Second Order Dynamics Featuring Tikhonov Regularization and Time Scaling

In a Hilbert setting we aim to study a second order in time differential equation, combining viscous and Hessian-driven damping, containing a time scaling parameter function and a Tikhonov regularization term. The dynamical system is related to the problem of minimization of a nonsmooth convex function. In the formulation of the problem as well as in our analysis we use the Moreau envelope of the objective function and its gradient and heavily rely on their properties. We show that there is a setting where the newly introduced system preserves and even improves the well-known fast convergence properties of the function and Moreau envelope along the trajectories and also of the gradient of Moreau envelope due to the presence of time scaling. Moreover, in a different setting we prove strong convergence of the trajectories to the element of minimal norm from the set of all minimizers of the objective. The manuscript concludes with various numerical results.


Introduction
In the Hilbert setting H , where •, • denotes the inner product and the norm is defined as usual • = √ •, • , we will study the convergence properties of the following second order in time differential equation with initial conditions x(t 0 ) = x 0 ∈ H , ẋ(t 0 ) = ẋ0 ∈ H , where α, β and t 0 > 0, λ : [t 0 , +∞) → R + and b : [t 0 , +∞) → R + are non-negative, non-decreasing and differentiable, : H → R = R∪{±∞} is a proper, convex and lower semicontinuous function and λ is its Moreau envelope of the index λ > 0 and the function ε : [t 0 , +∞) → R + is continuously differentiable and non-increasing with the property lim t→+∞ ε(t) = 0.In addition, we assume that argmin , which is the set of global minimizers of , is not empty and denote by * the optimal objective value of .
The system (1) has a connection to the minimization problem of a proper, convex and lower semicontinuous function .Studying such systems provides better understanding of their discrete counterpart-optimization algorithms, since there is a strong connection between them, and the question of transitioning from one to another attracts a lot of attention in the modern literature.One of the main goals of this research is to improve (compared to [23]) the fast rates of convergence for the Moreau envelope of the objective function and the objective function itself to * , as well as for the gradient of the Moreau envelope of the objective function in terms of the Moreau parameter function λ and the time scaling function b.Moreover, we also deduce the strong convergence of the trajectory of the dynamics to the minimal norm element of argmin .We introduce two settings with different assumptions for each result.To conclude we provide multiple numerical results in order to illustrate our theoretical discoveries.

Nonsmooth Optimization with Time Scaling
In the smooth setting the pioneering research in studying second order dynamical systems was conducted by Su-Boyd-Candes [30] for the sake of obtaining faster asymptotic convergence for convex functions.They managed to deduce the rates of convergence of the function values being of the order 1 t 2 .Later Attouch-Peypouquet-Redont [20] also established the weak (and in some particular cases the strong) convergence of the trajectories to a minimizer of the objective function.In [19] the same authors continued the development in this direction by adding Hessian-driven damping term in order to obtain the rates for the gradient of the objective function and to eliminate any possible oscillations in the dynamical behaviour of the trajectories.
Concerning the nonsmooth setting we must point out that the Moreau envelope of a proper, convex and lower semicontinuous function : H → R proved to be of a significant importance in designing continuous-time approaches and numerical algorithms for the minimization of nonsmooth functions.The rigorous definition of this construction is where λ > 0 is the parameter of the Moreau envelope (see, for instance, [21]).One of the most important properties of Moreau approximation is that for every λ > 0, the functions and λ share the same optimal objective value and also the same set of minimizers.Moreover, λ is convex and continuously differentiable with and ∇ λ is 1 λ -Lipschitz continuous, where denotes the proximal operator of of parameter λ.The last fact we would like to mention is that for every x ∈ H , the function λ ∈ (0, +∞) → λ (x) is nonincreasing and differentiable (see [14], Lemma A1), namely, Our research is a logical continuation of the one conducted in [24], where authors applied the time rescaling technique to a nonsmooth optimization problem (for more information on time scaling see also [5,10,11,13]).They considered the following system where α ≥ 1, t 0 > 0, and β : [t 0 , +∞) → [0, +∞) and b, λ : [t 0 , +∞) → (0, +∞) are differentiable functions.On the one hand, the presence of the Hessian damping term is believed to help reducing the oscillations in the dynamical behaviour and provides the rates for the gradient of the objective function .On the other hand, the time-scaling technique (which is considered to be an artificial way to speed up the convergence of values) affects the convergence rates while bringing more restrictions to the analysis.The following properties were established from where through proximal mapping the convergence rates for the objective function itself along the trajectory were obtained Note that by taking b(•) ≡ 1 we arrive at the well-known convergence rate of the values being of the order o 1 t 2 .In addition, the following rates for the gradient of the Moreau envelope were deduced Finally, the weak convergence of the trajectories x(t) to a minimizer of as t → +∞ was obtained.
In our analysis we borrow some ideas of [24] and develop them further in order to fit the new setting, namely, to adapt to a presence of the whole new term-Tikhonov regularization.The analysis becomes more involved and technical, some fundamental properties of Tikhonov regularization had to be proved for a nonsmooth setting.Its presence affects the set of conditions, which we have to impose on the system parameters: even though some of the conditions are formulated in the same spirit as in [24] (for instance, (11) and ( 14)), the other ones are completely new due to the presence of the Tikhonov term.Moreover, depending on how fast ε decays, two different setting arise providing different fundamental results (Sects.3 and 4).

Tikhonov Regularization
It turned out that having additional term with specific properties in a system equation leads to improving the weak convergence of the trajectories to a minimizer of the objective function to a strong one to the element of minimal norm of argmin .Such systems were studied, for instance, in [4,6,9,12,17,23,27].The main goal of such a research is to show that these systems preserve all the typical properties of the second order in time dynamical system (fast convergence of the values, the rates for the gradient etc.) but moreover there is an improvement to the strong convergence of the trajectories to the minimal norm solution instead of a weak one to an arbitrary minimizer.One of the many examples of such systems is presented below (see [23]) where α ≥ 3, t 0 > 0, : H → R is twice continuously differentiable and convex and for the rest of the section the function ε : [t 0 , +∞) → R + is continuously differentiable and non-increasing with the property lim t→+∞ ε(t) = 0.In that manuscript they provided two settings: one for the fast convergence of values obtaining and the weak convergence of the trajectories to a minimizer of and another setting for the strong convergence of x to x * , as t → +∞.
Another fine example is given in [4]: where α, t 0 > 0 and : H → R is continuously differentiable and convex.In that paper authors obtained the rates for the function values (x(t)) − * , as well as for the quantity x(t) − x ε(t) , as t → +∞, where . Thus, they assured the strong convergence of the trajectories to the minimal norm solution x * = proj argmin (0) under the appropriate assumptions and properly chosen energy functional, using the properties of Tikhonov regularization.The most important thing about this approach is that authors were able to establish fast convergence of values and strong convergence of the trajectories in the very same setting.
The next step was done in [6]: , : H → R is twice continuously differentiable and convex and p ∈ [0, 1].This system while preserving all the properties of (5), additionally provides the integral estimate for the norm of the gradient of ϕ t .

Our Contribution
In that paper we will develop the ideas presented in [23] to cover the nonsmooth case with time scaling.We will obtain the fast convergence of the function values (as well as for the gradient of the Moreau envelope of the objective fucntion ) for the family of dynamical systems (1) governed by the Moreau envelope of the nonsmooth function and having the Tiknonov term in their formulation: in terms of the function itself: and finally as t → +∞.
We will also deduce (under some appropriate conditions) the following result lim inf t→+∞ x(t) − x * = 0, which under some restrictions will be improved to the full strong convergence of the trajectories of (1) to the minimal norm solution.
The paper is organized in the following way.Section 2 is devoted to some preliminary results, which we will need later.We will establish the fast rates of convergence of function values and its Moreau envelope, as well as the gradient of Moreau envelope along the trajectories of the dynamical system (Sect.3).We will show that under some assumptions the strong convergence of the trajectories to the element of minimal norm from the set of all minimizers of the objective function takes place (Sect.4).We will provide two settings for the polynomial choice of parameter functions to fulfill the assumptions made through the analysis (Sect.5) and equip this manuscript with various numerical results (Sect.6).

Preparatory Results
We start with the following lemma (see [21], Proposition 12.22, for the first term of the lemma and [18], Appendix, A1, for the second one).
Lemma 1 Let : H → R be a proper, convex and lower semicontinuous function, λ, μ > 0. Then Let us mention two key properties of the Tikhonov regularization, which we will use later in the analysis (see, for instance, [2] or [21] Theorem 23.44 for its classic analogue).First let us introduce the strongly convex function ϕ ε(t),λ(t) : and denote the unique minimizer of ϕ ε(t),λ(t) as x ε(t),λ(t) = argmin H ϕ ε(t),λ(t) .Thus, the first order optimality condition reads as Now we are ready to formulate the following result: Then the following properties of the mapping t → x ε(t),λ(t) are satisfied: and Proof By the monotonicity of ∇ λ we deduce By (6) we obtain Using Cauchy-Schwarz inequality we derive This proves the first claim.For the second one consider (6) again and note that it is equivalent to Thus, the rest of the proof goes in line with Theorem 23.44 of [21].
Our nearest goal is to deduce the existence and uniqueness of the solutions of the dynamical system (1).Suppose β > 0. Let us integrate (1) from t 0 to t to obtain we deduce, that (1) is equivalent to Let us multiply the first line by the function b and the second one by the constant β and then sum them up to get rid of the gradient of the Moreau envelope in the second equation We denote now y(t) = βz(t) + b(t) − αβ t x(t), and, after simplification, we obtain the following equivalent formulation for the dynamical system In case β = 0 for every t ≥ t 0 , (1) can be equivalently written as Based on the two reformulations of the dynamical system (1) we formulate the following existence and uniqueness result, which is a consequence of Cauchy-Lipschitz theorem for strong global solutions.The result can be proved in the lines of the proofs of Theorem 1 in [16] or of Theorem 1.1 in [19] with some small adjustments.

Fast Convergence Rates of the Function and Moreau Envelope Values
This chapter is devoted to obtaining the rates of convergence for the Moreau envelope values and for the values of function itself.We will heavily rely on the tools and techniques provided by the Lyapunov analysis.We introduce a slightly modified energy function from [23].For 2 ≤ q ≤ α − 1 we define The key assumptions which are essential to our analysis are the following: for all t ≥ t 0 and Theorem 4 Suppose α ≥ 3 and assume that ( 11), ( 12), ( 13), ( 14) hold for all t ≥ t 0 .Then Moreover, one has for all a ≥ 1 If, in addition, α > 3 and ( 15) holds, then the trajectory x is bounded and Proof Let us compute the time derivative of the energy function.For every t ≥ t 0 using (3) we derive By ( 14) one has b(t) − β t > 0 for all t ≥ t 0 , and thus for a strongly convex function 2 x 2 we have Therefore, for every t ≥ t 0 Notice that for a ≥ 1 which leads to for every t ≥ t 0 .Note that b(t) − 1 a > 0 for all t ≥ t 0 .Then, due to the properties of b, there exists t * ≥ t 0 such that t 2 b(t) − β(q + 2 − α)t ≥ 0 for all t ≥ t * and all q ∈ (2, α − 1].Therefore, since λ(t) ≥ 0 for all t ≥ t 0 , there exists t * * , namely, Consider now two cases with t ≥ t * * .First, take q = α − 1 to obtain from ( 16) for every t ≥ t 0 .Under the assumptions ( 11) and ( 12) we conclude starting from t * * that Under the assumption (13) using the fact that t → E α−1 (t) is bounded from below we deduce the existence of the limit lim t→+∞ E α−1 (t) due to the Lemma A.1 and, therefore, t → E α−1 (t) is bounded, which leads to , as t → +∞.
Integrating (19) one may additionally obtain the integrability of t ẋ(t) 2 .From the integrability of The next theorem shows that we can actually improve the rates of convergence of the function values in case α > 3.
Theorem 5 Assume that α > 3 and ( 12), ( 13),( 14) and ( 15 In addition, lim t→+∞ ψ(t) = 0, where for 2 ≤ q ≤ α − 1 which in particular means and moreover, Proof (i) Let us first prove an auxiliary estimate (20), which will allow us to obtain the rest of the desired results.We return to Under condition (12) we deduce starting from Integrating the last inequality on [t 0 , t] we obtain Since the gradient ∇ λ is monotone, we know that Notice that by (15) we have Obviuosly, Introducing δ 1 = α − 1 − δ > 0 (by the choice of δ) we obtain From Theorem 4 we know that tb(t) λ(s) (x(s)) − * is integrable and therefore so is 2tb(t) Since the function t → E q (t) is bounded and the rest of the right hand side of ( 22) belongs to L 1 [t 0 , +∞), R by Theorem 4 and ( 23), we conclude with (20) due to (14).
(ii) In order to derive the convergence rates for the quantities of our interest we require some additional results.Our nearest goal is to establish the existence of the limits Consider (as was done in [23,24]) for two different q 1 , q 2 ∈ (2, α − 1) and for every t ≥ t 0 the difference As we have established earlier in Theorem 4 the limits of E q 1 (t) − E q 2 (t) and t λ(t) (x(t)) − * exists (the latter is actually zero).Therefore, the limit Let us introduce for every t ≥ t 0 two auxiliary functions Noticing that we may write for every t ≥ t 0 From the fact that lim t→+∞ k(t) exists using (20) we obtain that lim t→+∞ (α − 1)r (t) + t ṙ(t) also exists.Applying Lemma A.2 we deduce the existence of the limit lim t→+∞ r (t).Using (20) again we obtain the existence of the limits lim t→+∞ x(t)− x * and lim t→+∞ t ẋ(t) + β∇ λ(t) (x(t)), x(t) − x * .(iii) Finally, we are in position to prove (21) and the rest of the convergence rates.The key idea is to show that the limit exists and is actually zero.Let us return to the definition of our energy functional and rewrite it as Since the limits lim t→+∞ E q (t) and lim exists as well.Denote Let us show that the right hand side of ( 24) is integrable.Indeed, the first term is integrable by Theorem 4. As we have also established in Theorem 4, starting from t * * tb(t) − β λ(t) where a ≥ 1.Then, by (14) and λ(t) ≥ 0 for all t ≥ t 0 , we deduce that there exists t 1 ≥ t * * such that for all t ≥ t 1 So, by Theorem 4 the right hand side of (24) belongs to L 1 [t 1 , +∞), R .Therefore, t also belongs to L 1 [t 1 , +∞), R and since the limit lim t→+∞ ψ(t) exists we 123 deduce that it should be actually zero, which gives us (21).To complete the proof notice that by the definition of the proximal mapping, we have The conclusion follows immediately from ( 2) and ( 21).

Strong Convergence of the Trajectories
In this chapter we will establish the strong convergence of the trajectories to the minimal norm element of argmin .
In order to do so, we will need to modify assumption (13) from the previous chapter: Before moving to the main point of the section, let us prove an auxiliary result first.

Let us integrate the last inequality on [T, t]
On the other hand, for every t ≥ t 0 Thus, We deduce due to (25) and the Lemma A.3 that Thus, we establish By the definition of the proximal mapping Using the fact that λ is bounded for all t ≥ t 0 we deduce For the remaining part of this section we will use a different energy functional.Inspired by [23] we introduce the following functional, which we will heavily rely on throughout this section where v(t) = q(x(t) − x * ) + t( ẋ(t) + β∇ λ(t) (x(t)) and p, q ≥ 0. The proof of the following theorem draws inspiration from [10,15,23].
Proof As in [23] we will consider several cases with respect to the trajectory x staying either inside or outside the ball B (0, x * ).

Case I.
Assume that the trajectory x stays in the complement of the ball B for all t ≥ t 0 .This means nothing but x(t) ≥ x * for every t ≥ t 0 .
(i) Our nearest goal is to obtain the upper bound for the derivative of E p,q .In order to do so, let us evaluate its time derivative for every t ≥ t 0 first.
Consider for every t ≥ t 0 the inner product v(t), v(t) : where above we used (1).Consider now for every t ≥ t 0 , The two estimates that we made above lead to (31) becoming Let us apply the gradient inequality to the strongly convex function and thus for every t ≥ t 0 .So, noticing that we deduce In order to proceed further we will need the following estimates: for every t ≥ t 0 , some c ≥ 1 and a ≥ 1.Thus, Let us fix First of all, due to this choice q + 1 − α + p = 0 and thus we get rid of the term ẋ(t), x(t) − x * .Secondly, Then So, Obviously, for t large enough, say, t ≥ t 2 ≥ t 0 the following expression is non-positive due to (27) and Moreover, from (28) it follows that for c = 1 So, under the assumption (12) and the fact that x(t) ≥ x * for all t ≥ t 0 we deduce due to (33) Thus, under the assumptions ( 12), ( 27), ( 28) and ( 29) (the latest leads to the nonpositivity of the coefficient of ∇ λ (x) 2 ) we conclude due to (32) that for every (ii) Let us obtain now the lower bound for E p,q .Notice that for p = α−3 3 and q = 2α 3 we have α − p − q = 1 and since tb(t) − β ≥ t 2 for every t ≥ t 0 by b(t 0 ) ≥ 1 2 + β t 0 and b being non-decreasing.On the other hand, applying the gradient inequality to the strongly convex function By the definition of ϕ ε(t),λ(t) (x) we deduce We may now add the last two inequalities to obtain Note that due to (28) Since α > 3 we deduce Finally, by ( 9) and (30)   From this and the weak convergence of the trajectory x follows the strong one: lim t→+∞ x(t) = x * .

Case III.
Assume that for t ≥ t 0 the trajectory x finds itself both inside and outside the ball B(0, x * ).Since x is continuous, there exists a sequence {t n } n∈N ⊆ [t 0 , +∞) such that t n → ∞ as n → ∞ and x(t n ) = x * for every n ∈ N. Consider again a weak sequential cluster point x of the sequence {x(t n )} n∈N .By repeating the same argument as in the previous case we deduce the weak convergence of {x(t n )} n∈N to x * , as n → ∞.Since x(t n ) → x * , as n → ∞, we obtain that x(t n ) − x * → 0, as n → ∞, which means lim inf t→+∞ x(t) − x * = 0. Remark 1 In this section the condition ḃ(t) ≥ 0 for all t ≥ t 0 is not necessary.Our conjecture is that we can weaken the setting by omitting this condition and thus widen the range for b, including the functions that decay not faster than 1 t 2 for the polynomial choice of parameters.
Remark 2 There is no setting which guarantees both fast rates for the values and strong convergence of the trajectories.One of the future goal would be to develop a new approach (based on [6]), which would help us deduce these two results simultaneously.

Strong Convergence of the Tajectories in Cse ˛= 3
Throughout this section we no longer require that b is non-decreasing.In this case the analogue of Theorem 6 looks as follows.
Theorem 8 Suppose that for all t ≥ t 0 the function λ is bounded, b(t) ≡ b > 0 is a constant function and ( 12) and ( 14) hold.Suppose additionally that (25)  Proof In this case the energy functional becomes Relation ( 16) thus becomes for all t ≥ t 0 Thus, repeating the same arguments as in Theorem 4 we obtain 123 Let us multiply this expression with t(bt − β) to obtain Now, we will divide by (bt − β) 2 to conclude Integrating the last inequality on [T , t], where T ≥ t 0 , we deduce By the definition of E 2 we know Combining these two inequalities, we deduce Applying Lemma A.3 we deduce due to ( 25) Therefore, we establish 2. For the strong convergence of the trajectories we require the following for all t ≥ t 0 : We will analyse these conditions in details for the polynomial choice of functions b and ε, namely, b(t) = bt n and ε(t) = ε t d , where b is positive, n ≥ 0 and ε, d > 0.

Setting for the Fast Convergence Rates of the Function Values
The set of the conditions becomes for all t ≥ t 0 1. α > 3; 2. there exists a ≥ 1 such that − 2dε t d+1 ≤ − aβε 2 t 2d , 3. b(t 0 ) ≥ β t 0 and b(t 0 ) > 1 a ; 4. After some simple algebraic computations one may discover that in order to satisfy all the conditions at the same time it is enough to assume   First, let us take different time scaling parameter b with l = 0 and d = 3 and see how it affects the behaviour of the system (1) (see Fig. 1).
As expected, the faster b grows, the faster the convergence is.Consider now different Moreau envelope parameter λ with d = 3 and n = 0 (see Fig. 2).
Note that the difference in the starting point comes from the fact that t 0 = 1, and for different exponents l the value t l 0 is also different.As predicted by theory, a faster growing function λ leads to faster convergence of not only the gradient of Moreau envelope of the objective function , but also of the values of the Moreau envelope themselves.
Varying the Tikhonov function ε for n = 0 and l = 0 does not affect the system, which is illustrated by the following plot (see Fig. 3).
The set argmin is nothing but the segment [−1, 1] and 0 is its element of minimal norm.Let us fix α = 6 and n = 0.7.First we take constant lambda (λ(t) = 1 for all t ≥ t 0 ) and plot the behaviour of the trajectories of (1) with and without Tikhonov term (see Fig. 4.
As we see in case there is no Tikhonov regularization the trajectories converge to the minimizer 1 of , but the Tikhonov term actually guarantees the convergence towards the minimal norm solution, which is 0.
Another comparison was made for non-constant lambda: λ(t) = 1 − 1 t l for l = 1 (for different l's the picture is the same), illustrating similar behaviour (see Fig. 5).
Finally, for the same choice of λ let us take different Tikhonov terms to figure out how changing them affects the trajectories of (1) (see Fig. 6).
We see, that the faster ε decays, the slower trajectories converge.

Appendix A
Let us state here some auxiliary lemmas which we used in our analysis.For the proof of the following lemma we refer to [3].
Lemma A.1 Suppose that f : [t 0 , +∞) → R is locally absolutely continuous and bounded from below and there exists g ∈ L 1 ([t 0 , +∞), R) such that for almost all t ≥ t 0 d dt f (t) ≤ g(t).
For the proof of the next lemma we refer to [19].
For the proof of the final Lemma we refer to [9].

Fig. 5 Fig. 6 n
Fig. 5 The role of the Tikhonov term
Assume now the opposite to the first case, namely, x(t) < x * for every t ≥ t 0 .Considering a sequence {t k } k∈N such that {x(t k )} k∈N converges weakly to an element x ∈ H as k → ∞, we notice that {ξ(t k )} k∈N converges weakly to x as k → ∞.Now, the function being convex and lower semicontinuous in the weak topology, allows us to write